from types import GeneratorTypeimport pandas as pdimport altair as altimport numpy as npimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn import metricsclass Machine:def__init__(self):self.denver = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv')self.ml = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv')self.denver_samp =self.denver.sample(n=4999)def ml_model(self,feature_lst, classifier): ml =self.ml x = ml.filter(feature_lst) y = ml.before1980 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =.25, random_state =2003)# create the model# train the model classifier.fit(x_train, y_train)# make predictions y_predictions = classifier.predict(x_test) feature_names = x.columns importances = classifier.feature_importances_ importances_df = pd.DataFrame({'Features' : feature_names, 'Importances': importances}) importances_df.head() chart = alt.Chart(importances_df).mark_bar().encode( x= alt.X('Features'), y= alt.Y('Importances') ).properties( title='Features and their Importance')print(metrics.classification_report(y_test,y_predictions))return chartdef explore(self,test): df =self.denver_samp chart = alt.Chart(df).mark_boxplot().encode( x= alt.X(test), y= alt.Y('yrbuilt', scale=alt.Scale(domain=(1850,2050))) )return chartdef opp(self,test): df =self.denver_samp chart = alt.Chart(df).mark_circle().encode( x= alt.X('yrbuilt', scale=alt.Scale(domain=(1850,2050))), y= alt.Y(test) )return chartmodel = Machine()
For setting up this project, I first imported all the packages that I would need. Then inorder to make the play part quicker I created a class where I could type enter in different lists or variable that would return the metrics and the chart showing how effective each feature is to the predictions.
Using the stories chart we can see one story houses were mostly built before the 1980’s and that three story houses and 4 four story houses were largely built after the 1980’s.
Looking at the box plots we can get a better understanding of the distribution of when different styles where built throughout time. This can be extremely helpful when looking to see if a house would be built before or after 1980 because a lot of the styles seem to be densely packed clearly before or after the 1980 line, with a few exceptions.
Predictive Model
As I have stated above I collected the whole proccess into a method within a class that I created. I have found that this is a clean organized process where I can plug and play with the different features or classifiers. Heres what happens when I call my method.
Show the code
features = (["numbaths",'stories','livearea', 'gartype_None', 'quality_A', 'quality_C', 'quality_D','quality_X', 'status_I', 'condition_Good', 'sprice','arcstyle_ONE-STORY', 'arcstyle_CONVERSIONS', 'arcstyle_ONE AND HALF-STORY','gartype_att/CP', 'gartype_det/CP','condition_Excel', 'condition_Fair','condition_AVG','arcstyle_BI-LEVEL', 'arcstyle_CONVERSIONS', 'arcstyle_ONE AND HALF-STORY','arcstyle_ONE-STORY','arcstyle_TRI-LEVEL', 'arcstyle_TRI-LEVEL WITH BASEMENT', 'arcstyle_TWO-STORY','totunits','finbsmnt'])model.ml_model(features ,RandomForestClassifier())
When running this method it returns the metrics about this specific model and a graph showing what features are the most important. So that I can look at before and after of multiple metric and graphs to see which I should include and exclude when expieramenting. As you can see the accuracy in the matrix above is slightly above 90%.
Features
As you can see in the graph above that our top three features of importance are the “living area”, “selling price”, and whether or not it was a “one story house.” I personally feel that houses have been getting been getting larger over time so it would make sense that the 1980’s living area could sit in certain parameters. Houses of certain ages seem to sell for about the same pricces. And as we saw in the exploratory analysis above one story houses are mostly built before 1980.
Metrics
Looking at the metrics matrix above you can see multiple decimal numbers which represent precents. We are going to focus on precision and recall.
Precision
Precision calculates the ability to identify only the relevant data points. So a measure of doing what it’s supposed to do. The number associated with 1 in precision is 0.93 showing use that the model correctly identified the important features 93% of the time.
Recall
Recall is the ability of your model to find all the relevant cases in your model. Showing how many truths were identified out of all the truths that were supposed to be identified.The number associated with 1 in recall is 0.93 showing use that the model correctly identified the relivant features 93% of the time.
Summary
Thoughout this document we were able to look at what features are important in determining whether or not a house was built pre- 1980, and throughout we used important machine learning principals and basics: determining what features to include, what classifier to use, and how we can determine the performance of a model through the use of different metrics.